Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make format prefix optional for format options in COPY #9723

Merged
merged 5 commits into from
Mar 22, 2024

Conversation

tinfoil-knight
Copy link
Contributor

Which issue does this PR close?

Closes #9716 .

Rationale for this change

Before this PR
User has to prefix all format options they pass to the COPY command with format. For eg:

COPY source_table
to 'test_files/scratch/copy/table_csv_with_options'
STORED AS CSV OPTIONS (
	'format.has_header' false,
	'format.compression' uncompressed,
	'format.datetime_format' '%FT%H:%M:%S.%9f',
	'format.delimiter' ';',
	'format.null_value' 'NULLVAL'
);

This is repetitive & different from what Datafusion 36.0.0 offered. (See issue description for more details)

After this PR

User can use the format options for the COPY command without prefixing them with format:

COPY source_table to 'test_files/scratch/copy/format_table.csv'
OPTIONS (
    has_header false,
    compression xz,
    datetime_format '%FT%H:%M:%S.%9f',
    delimiter ';',
    null_value 'NULLVAL'
);

What changes are included in this PR?

We're prefixing keys passed in options with format if they don't contain a namespace (signified by presence of .).

Are these changes tested?

Yes. New tests are added to verify that options not prefixed by format work.

Are there any user-facing changes?

Current docs have the legacy format so updates aren't needed.

@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Mar 21, 2024
options.insert(renamed_key.to_lowercase(), value_string.to_lowercase());
} else {
options.insert(key.to_lowercase(), value_string.to_lowercase());
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for @devinjdangelo

In the previous similar PR, we were only renaming keys when the file type matches JSON, CSV or PARQUET.
See https://github.com/apache/arrow-datafusion/pull/9594/files#diff-bb98aeaec478d9576e33911f252745961f57e38d25f6e9d13803894dfbad25d5R876-R881

I decided to not do this now since it'd seem weird if some file types need a format prefix & others don't.

Let me know if you think differently.


Sidenote: Other file formats don't have any supported format options currently.

@tinfoil-knight
Copy link
Contributor Author

1 test is failing on CI for multiple test suites. Wasn't able to reproduce the failure locally.

@alamb
Copy link
Contributor

alamb commented Mar 21, 2024

1 test is failing on CI for multiple test suites. Wasn't able to reproduce the failure locally.

I think it is due to the new Rust release, #9724 (which has since been fixed)

@alamb
Copy link
Contributor

alamb commented Mar 21, 2024

I merged up to main to get the CI passing

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tinfoil-knight -- this PR is most appreciated and seems to do exactly what it says.

I pushed one additional commit to verify that the error message for unknown cases is reasonable and another to change the comments in the test that avoiding the .format is desired

cc @devinjdangelo @ozankabak @metesynnada -- I think this PR is quite nice and elegantly supports a nicer UX while retaining the unified configuration support

@alamb
Copy link
Contributor

alamb commented Mar 22, 2024

I plan to merge this PR later today unless anyone would like additional time to review

@alamb alamb merged commit 4913a00 into apache:main Mar 22, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 22, 2024

Thanks again @tinfoil-knight

@tinfoil-knight tinfoil-knight deleted the 9716-format-prefix branch March 22, 2024 19:50
// compatibility.

let renamed_key = format!("format.{}", key);
options.insert(renamed_key.to_lowercase(), value_string.to_lowercase());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tinfoil-knight @alamb, just would like to know the history of why we would make the value string lower case, I am encountering this case issue in another PR as described https://github.com/apache/datafusion/pull/10792/files#r1632877093. Thanks :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change was originally added in this PR: #9382

It seems like the author of that PR was trying to consolidate case changes (using to_lowercase) in multiple places to just a single place.

For now, you could add an exclusion for the access token option that is case sensitive in your PR so you can continue your work.

Ideally, we should probably stop lower-casing the options and handle case-sensitivity in the specific reader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having keys lowercased makes sense for standardization purposes, but I'm not sure why we are also lowercasing option values. It may either be because (1) of the same standardization concerns, or (2) by mistake.

The most flexible way would be to make value (not key) standardization logic customizable, with the default being lowercase. If we had that, users like @xinlifoobar would be able to avoid standardization modifications. Other users who want other kinds of standardizations would be able to override it etc.

@tinfoil-knight, would you be interested in thinking about how we can do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened an issue here: #10853

Making the value standardization logic customizable makes sense. Integrations should receive the original value and be able to decide what to do with it.

I'll attempt a fix for the issue this weekend if it hasn't already been picked up by then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: All formatting options in COPY commands require format. prefix, but did not in DataFusion 36.0.0
4 participants